approximation error and sample complexity
Agnostic Q -learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity
The current paper studies the problem of agnostic $Q$-learning with function approximation in deterministic systems where the optimal $Q$-function is approximable by a function in the class $\mathcal{F}$ with approximation error $\delta \ge 0$. We propose a novel recursion-based algorithm and show that if $\delta = O\left(\rho/\sqrt{\dim_E}\right)$, then one can find the optimal policy using $O(\dim_E)$ trajectories, where $\rho$ is the gap between the optimal $Q$-value of the best actions and that of the second-best actions and $\dim_E$ is the Eluder dimension of $\mathcal{F}$. Our result has two implications: \begin{enumerate} \item In conjunction with the lower bound in [Du et al., 2020], our upper bound suggests that the condition $\delta = \widetilde{\Theta}\left(\rho/\sqrt{\dim_E}\right)$ is necessary and sufficient for algorithms with polynomial sample complexity.
Agnostic Q -learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity
The current paper studies the problem of agnostic Q -learning with function approximation in deterministic systems where the optimal Q -function is approximable by a function in the class \mathcal{F} with approximation error \delta \ge 0 . We propose a novel recursion-based algorithm and show that if \delta O\left(\rho/\sqrt{\dim_E}\right), then one can find the optimal policy using O(\dim_E) trajectories, where \rho is the gap between the optimal Q -value of the best actions and that of the second-best actions and \dim_E is the Eluder dimension of \mathcal{F} . Our result has two implications: \begin{enumerate} \item In conjunction with the lower bound in [Du et al., 2020], our upper bound suggests that the condition \delta \widetilde{\Theta}\left(\rho/\sqrt{\dim_E}\right) is necessary and sufficient for algorithms with polynomial sample complexity. We further extend our algorithm to the stochastic reward setting and obtain similar results.
Review for NeurIPS paper: Agnostic Q -learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity
Weaknesses: The proof, as described by the authors themselves, depend on the assumption on the gap optimality. The relationship between the approximation error and this optimality gap is crucial, a larger approximation error requires a larger gap to ensure the favorable properties. It is not entirely clear whether these bounds are meaningful in practice. Secondly, the algorithm for the general case requires an oracle to determine the most uncertain action given a state for the approximation family F. While it is argued that a similar oracle is used in previous work, it is not clear whether this is more realistic than previous work dismissed by the authors in related work ("Know-What-It-Knows" oracle in Li et al. 2011). The proof applies only to deterministic systems, restricting its application significantly.
Agnostic Q -learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity
The current paper studies the problem of agnostic Q -learning with function approximation in deterministic systems where the optimal Q -function is approximable by a function in the class \mathcal{F} with approximation error \delta \ge 0 . We propose a novel recursion-based algorithm and show that if \delta O\left(\rho/\sqrt{\dim_E}\right), then one can find the optimal policy using O(\dim_E) trajectories, where \rho is the gap between the optimal Q -value of the best actions and that of the second-best actions and \dim_E is the Eluder dimension of \mathcal{F} . Our result has two implications: \begin{enumerate} \item In conjunction with the lower bound in [Du et al., 2020], our upper bound suggests that the condition \delta \widetilde{\Theta}\left(\rho/\sqrt{\dim_E}\right) is necessary and sufficient for algorithms with polynomial sample complexity. We further extend our algorithm to the stochastic reward setting and obtain similar results.